Let’s now load in that model, but assign it to a different variable
name. We can read in an .rds file with the
readr::read_rds() function.
train_dataset <- readr::read_rds("my_train_data.rds")
train_dataset |> glimpse()
## Rows: 835
## Columns: 15
## $ R <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 14…
## $ G <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, …
## $ B <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132,…
## $ Lightness <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", …
## $ Saturation <chr> "bright", "bright", "bright", "bright", "bright", "brigh…
## $ Hue <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 1…
## $ response <dbl> 12, 10, 16, 10, 11, 16, 10, 19, 14, 25, 14, 19, 14, 38, …
## $ outcome <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ LightnessNum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ SaturationNum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ y <dbl> -1.9924302, -2.1972246, -1.6582281, -2.1972246, -2.09074…
## $ logit_R <dbl> 0.72865387, -2.17562547, 0.72865387, -2.09274551, 0.6931…
## $ logit_G <dbl> -Inf, -1.62924054, -1.40691365, -1.66965677, -3.08534443…
## $ logit_B <dbl> -2.4624338, 0.1784828, -2.7950616, 0.1996131, -2.7950616…
## $ logit_Hue <dbl> -2.36712361, 1.79175947, -1.38629436, 2.04769284, -2.047…
df <- readr::read_rds("df.rds")
viz_grid <- readr::read_rds("viz_grid.rds")
A. Intercept-only model – no INPUTS! B. Categorical variables only – linear additive C. Continuous variables only – linear additive D. All categorical and continuous variables – linear additive E. Interaction of the categorical inputs with all continuous inputs main effects F. Add categorical inputs to all main effect and all pairwise interactions of continuous inputs G. Interaction of the categorical inputs with all main effect and all pairwise interactions of continuous inputs
3 models with basis functions of your choice:
H. Try non-linear basis functions based on your EDA. I. Can consider interactions of basis functions with other basis functions! J. Can consider interactions of basis functions with the categorical inputs!
# A. Intercept-only model:
modA <- glm(outcome ~ 1, data = train_dataset, family = binomial)
#B. Categorical variables only – linear additive
modB <- glm(outcome ~ Lightness + Saturation, data = train_dataset, family = binomial)
# C. Continuous variables only – linear additive
modC <- glm(outcome ~ R + G + B + Hue, data = train_dataset, family = binomial)
# D. All categorical and continuous variables – linear additive
modD <- glm(outcome ~ Lightness + Saturation + R + G + B + Hue, data = train_dataset, family = binomial)
# E. Interaction of the categorical inputs with all continuous inputs main effects
modE <- glm(outcome ~ Lightness * R + Lightness * G + Lightness * B + Lightness * Hue +
Saturation * R + Saturation * G + Saturation * B + Saturation * Hue, data = train_dataset, family = binomial)
# F. Add categorical inputs to all main effect and all pairwise interactions of continuous inputs
modF <- glm(outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation, data = train_dataset, family = binomial)
# G. Interaction of the categorical inputs with all main effect and all pairwise interactions of continuous inputs
modG <- glm(outcome ~ (Lightness + Saturation) * (R + G + B + Hue)^2, data = train_dataset, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
library(splines)
# H. Model with non-linear basis functions based on EDA
# Using natural cubic splines for continuous variables
modH <- glm(outcome ~ ns(Hue, df = 3), data = train_dataset, family = binomial)
# I. Can consider interactions of basis functions with other basis functions!
modI <- glm(outcome ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3), data = train_dataset, family = binomial)
# Model 10: Model considering interactions of basis functions with other basis functions and categorical inputs
# Interaction of spline basis functions with LightnessNum
modJ <- glm(outcome ~ (ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3)) * Lightness, data = train_dataset, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modF %>% readr::write_rds("modF.rds")
modB %>% readr::write_rds("modB.rds")
Yes, it indicates that the logistic regression model fitted by glm() encountered numerical convergence issues. This warning often arises when the model is perfectly separating the classes based on the predictors, leading to predicted probabilities of 0 or 1 for some observations.
The solution provided below first defines a function extract_metrics() which is a wrapper to the broom::glance() function. An additional model name column is added to the broom::glance() results.
extract_metrics <- function(mod_object, mod_name)
{
broom::glance(mod_object) %>%
mutate(model_name = mod_name)
}
The logistic regression model training set performance metrics are extracted for each of the 10 models fit in the previous problem.
glm_mle_results <- purrr::map2_dfr(list(modA, modB, modC, modD,
modE, modF, modG, modH, modI, modJ),
LETTERS[1:10],
extract_metrics)
print(glm_mle_results)
## # A tibble: 10 × 9
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 898. 834 -449. 900. 905. 898. 834 835
## 2 898. 834 -357. 740. 801. 714. 822 835
## 3 898. 834 -434. 878. 902. 868. 830 835
## 4 898. 834 -348. 731. 811. 697. 818 835
## 5 898. 834 -289. 707. 1014. 577. 770 835
## 6 898. 834 -315. 677. 786. 631. 812 835
## 7 898. 834 -147. 581. 1257. 295. 692 835
## 8 898. 834 -441. 890. 909. 882. 831 835
## 9 898. 834 -366. 758. 820. 732. 822 835
## 10 898. 834 -315. 770. 1101. 630. 765 835
## # ℹ 1 more variable: model_name <chr>
The AIC and BIC are visualized for the 10 models.
glm_mle_results %>%
select(model_name, AIC, BIC) %>%
pivot_longer(c(AIC, BIC)) %>%
ggplot(mapping = aes(x = model_name, y = value)) +
geom_point(size = 2) +
facet_wrap(~name, scales = 'free_y') +
theme_bw()
modG$formula
## outcome ~ (Lightness + Saturation) * (R + G + B + Hue)^2
modF$formula
## outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation
modB$formula
## outcome ~ Lightness + Saturation
modD$formula
## outcome ~ Lightness + Saturation + R + G + B + Hue
The coefficients associated with the 3 best models according to BIC are shown below.
modF %>%
coefplot::coefplot(intercept = FALSE) +
theme_bw()
The statistically significant efficients in modF are: 1. Saturationsubdued 2. Saturation shaded 3. SaturationPure 4. Saturation neutral 5. SaturationGray 6. Lightness saturated 7. Lightness pale 8. lightness light 9. lightness midtone.
modB %>%
coefplot::coefplot(intercept = FALSE) +
theme_bw()
The statistically significant efficients in modB are: 1. Saturationsubdued 2. Saturation shaded 3. SaturationPure 4. Saturation neutral 5. SaturationGray 6. Lightness saturated
modD %>%
coefplot::coefplot(intercept = FALSE) +
theme_bw()
The statistically significant efficients in modD are: 1. Saturationsubdued 2. Saturation shaded 3. SaturationPure 4. Saturation neutral 5. SaturationGray
Based on the coefficient summaries provided for the top 3 models according to BIC, we can make the following comparisons:
Overlap in significant coefficients: Across all three models (modF, modB, and modD), we observe several common statistically significant coefficients related to Saturation and Lightness. Specifically, coefficients associated with various levels of Saturation (e.g., subdued, shaded, pure, neutral, and gray) and Lightness (e.g., saturated, pale, light, and midtone) appear to be significant in all three models.
Additional significant coefficients in modF: Model modF includes additional significant coefficients compared to modB and modD. These additional coefficients might be related to the interactions between the categorical inputs (Saturation and Lightness) and continuous inputs (R, G, B, and Hue) as specified in the model.
Differences in model complexity: Model modF appears to be the most complex among the top 3 models, as it includes interactions between categorical and continuous inputs, as well as additional pairwise interactions of continuous inputs. In contrast, modB and modD have simpler structures, with modD being the simplest.
Consistency in significant coefficients: Despite differences in model complexity, there is consistency in the significant coefficients related to Saturation and Lightness across all three models. This suggests that these factors play a crucial role in predicting the outcome variable, regardless of the specific model structure.
Overall, while modF captures additional interactions between categorical and continuous inputs, the significant coefficients related to Saturation and Lightness remain consistent across all three models. This consistency highlights the importance of these factors in predicting the outcome variable and suggests that they should be considered in further analyses or predictive modeling efforts.
Based on the coefficient summaries and the consistency of statistically significant coefficients across the top 3 models according to BIC, it appears that the following inputs are important predictors of the outcome variable:
Saturation: Various levels of Saturation, including subdued, shaded, pure, neutral, and gray, are consistently identified as significant predictors across all three models. This suggests that the degree of Saturation in the data significantly influences the outcome variable.
Lightness: Different categories of Lightness, such as saturated, pale, light, and midtone, are also consistently identified as significant predictors in all three models. This indicates that the brightness or darkness of colors plays a crucial role in predicting the outcome.
Interactions between Saturation, Lightness, and other continuous inputs: Model modF, which includes interactions between categorical inputs (Saturation and Lightness) and continuous inputs (R, G, B, and Hue), suggests that these interactions might capture additional predictive information. While not explicitly mentioned in the coefficient summaries, the presence of interaction terms in modF implies that the relationship between Saturation, Lightness, and other continuous inputs might be more nuanced and predictive than their main effects alone.
Conclusion: Saturation and Lightness, along with their potential interactions with other continuous inputs, emerge as important predictors of the outcome variable based on the coefficient summaries of the top 3 models. These findings highlight the significance of color characteristics in influencing the outcome and suggest that further exploration of these relationships could enhance predictive modeling efforts.
Now that you have gained insights into the relationships, it’s time to delve deeper into the uncertainty by employing Bayesian logistic regression models. In our lectures, we explored how the linear model framework can be extended to handle binary outcomes that are not continuous. This involves changing the likelihood distribution from Gaussian to Binomial and necessitates the use of a non-linear link function. Similar to ordinary linear models, the regression model is applied to a linear predictor that mimics the trend. In this Part, we will formulate the log-posterior function for logistic regression. By doing so, we will be able to directly compare and contrast this process with defining the log-posterior function for the linear model in the previous Part.
The complete probability model for logistic regression consists of the likelihood of the response \({y_n}\) given the event probability \({\mu_n}\), the inverse link function between the probability of the event, \({\mu_n}\) and the linear predictor, ηn, and the prior on all linear predictor model coefficients β. Assume that the β-parameters are a-priori independent Gaussians with a shared prior mean μβ and a shared prior standard deviation τβ.
The n-th observation’s linear predictor using the inner product of the n-th row of a design matrix xn,: and the unknown β-parameter column vector. You can assume that the number of unknown coefficients is equal to D+1. Fitting 10 Bayesian logistic regression models using the same linear predictor trend expressions that you used in the non-Bayesian logistic regression models.
For this analysis, we have decided to select two models for Bayesian logistic regression: modF and modD. These models were chosen based on their performance in terms of BIC, a criterion that balances goodness of fit with model complexity.
modF includes all categorical inputs (Lightness and Saturation) along with all main effects and pairwise interactions of continuous inputs (R, G, B, and Hue). This model is advantageous because it captures potential interactions between the continuous variables, providing a more comprehensive understanding of their combined influence on the outcome. Additionally, including all pairwise interactions allows for a more flexible modeling approach, potentially capturing nonlinear relationships between the predictors and the outcome.
On the other hand, modD incorporates both categorical and continuous variables in a linear additive manner. While it may not capture interactions between continuous variables as explicitly as modF, it still provides a solid foundation for examining the independent effects of each predictor on the outcome. This model is particularly valuable for assessing the individual contributions of categorical and continuous variables to the logistic regression model.
By selecting these models, we aim to explore the impact of different model specifications on the Bayesian logistic regression framework, gaining insights into how the choice of predictors and their interactions influence model inference and prediction.
I revisited this stage after encountering issues with the
my_laplace function due to infinite values generated by
modF and modG. As a result, I’ve decided to
proceed with models modD and modB, which rank
among the top three best models based on BIC scores. I encountered
difficulties when applying the my_laplace function to
modF and modG because the calculated values
for mu were either very large or very small, making the
process impractical.
Xmat_F <- model.matrix( modF$formula, data = train_dataset )
Xmat_D <- model.matrix( modD$formula, data = train_dataset )
Xmat_B <- model.matrix( modB$formula, data = train_dataset )
print("The coefficient names associated with modF:")
## [1] "The coefficient names associated with modF:"
modB %>% coef() %>% names()
## [1] "(Intercept)" "Lightnessdeep" "Lightnesslight"
## [4] "Lightnessmidtone" "Lightnesspale" "Lightnesssaturated"
## [7] "Lightnesssoft" "Saturationgray" "Saturationmuted"
## [10] "Saturationneutral" "Saturationpure" "Saturationshaded"
## [13] "Saturationsubdued"
print("The coefficient names associated with modD:")
## [1] "The coefficient names associated with modD:"
modD %>% coef() %>% names()
## [1] "(Intercept)" "Lightnessdeep" "Lightnesslight"
## [4] "Lightnessmidtone" "Lightnesspale" "Lightnesssaturated"
## [7] "Lightnesssoft" "Saturationgray" "Saturationmuted"
## [10] "Saturationneutral" "Saturationpure" "Saturationshaded"
## [13] "Saturationsubdued" "R" "G"
## [16] "B" "Hue"
The column names associated with the Xmat_F and Xmat_D design matrix are displayed below: Same as above.
Xmat_B %>% colnames()
## [1] "(Intercept)" "Lightnessdeep" "Lightnesslight"
## [4] "Lightnessmidtone" "Lightnesspale" "Lightnesssaturated"
## [7] "Lightnesssoft" "Saturationgray" "Saturationmuted"
## [10] "Saturationneutral" "Saturationpure" "Saturationshaded"
## [13] "Saturationsubdued"
Xmat_D %>% colnames()
## [1] "(Intercept)" "Lightnessdeep" "Lightnesslight"
## [4] "Lightnessmidtone" "Lightnesspale" "Lightnesssaturated"
## [7] "Lightnesssoft" "Saturationgray" "Saturationmuted"
## [10] "Saturationneutral" "Saturationpure" "Saturationshaded"
## [13] "Saturationsubdued" "R" "G"
## [16] "B" "Hue"
The code chunk below uses the all.equal() function to confirm the names are the same between the non-Bayesian models and the design matrices:
purrr::map2_lgl(purrr::map(list(modD,
modB),
~names(coef(.))),
purrr::map(list(Xmat_D,
Xmat_B),
colnames),
all.equal)
## [1] TRUE TRUE
The log-posterior function you will program requires the design matrix, the observed output vector, and the prior specification. The lists adhere to the same naming convention as the design matrices. The info_F list pertains to the details regarding model F, whereas info_d pertains to the specifics regarding model D.The lists adhere to the same naming convention as the design matrices. The info_F list pertains to the details regarding model F, whereas info_d pertains to the specifics regarding model D.
info_F <- list(
yobs = train_dataset$outcome,
design_matrix = Xmat_F,
mu_beta = 0,
tau_beta = 4.5
)
info_D <- list(
yobs = train_dataset$outcome,
design_matrix = Xmat_D,
mu_beta = 0,
tau_beta = 4.5
)
info_B <- list(
yobs = train_dataset$outcome,
design_matrix = Xmat_B,
mu_beta = 0,
tau_beta = 4.5
)
Define the log-posterior function for logistic regression, logistic_logpost(). The first argument to logistic_logpost() is the vector of unknowns and the second argument is the list of required information.
Do you need to separate the β-parameters from the unknowns vector? No,The only unknowns to the logistic regression model are the regression coefficients! The unknowns vector therefore only consists of the β parameters!
logistic_logpost <- function(unknowns, my_info)
{
# extract the design matrix and assign to X
X <- my_info$design_matrix
# calculate the linear predictor
eta <- as.vector( X %*% as.matrix(unknowns))
# calculate the event probability
mu <- boot::inv.logit(eta)
# evaluate the log-likelihood
log_lik <- sum(dbinom(x = my_info$yobs,
size = 1,
prob = mu,
log = TRUE))
# evaluate the log-prior
log_prior <- sum(dnorm(x = unknowns,
mean = my_info$mu_beta,
sd = my_info$tau_beta,
log = TRUE))
# sum together
log_lik + log_prior
}
logistic_logpost(rep(0, ncol(Xmat_D)), info_D)
## [1] -619.9692
logistic_logpost(rep(0, ncol(Xmat_B)), info_B)
## [1] -610.2771
The my_laplace() function is provided in the code chunk below:
my_laplace <- function(start_guess, logpost_func, ...)
{
# code adapted from the `LearnBayes`` function `laplace()`
fit <- optim(start_guess,
logpost_func,
gr = NULL,
...,
method = "BFGS",
hessian = TRUE,
control = list(fnscale = -1, maxit = 5001))
mode <- fit$par
post_var_matrix <- -solve(fit$hessian)
p <- length(mode)
int <- p/2 * log(2 * pi) + 0.5 * log(det(post_var_matrix)) + logpost_func(mode, ...)
# package all of the results into a list
list(mode = mode,
var_matrix = post_var_matrix,
log_evidence = int,
converge = ifelse(fit$convergence == 0,
"YES",
"NO"),
iter_counts = as.numeric(fit$counts[1]))
}
Assign the results to the laplace_F through laplace_F objects. The names should be consistent with the design matrices and lists of required information. Thus, laplace_F must correspond to the info_F and Xmat_F objects.
The initial guess does not matter. Logistic regression has a single posterior mode! The optimization will converge as long as we use enough iterations.
laplace_D <- my_laplace(rep(0, ncol(Xmat_D)), logistic_logpost, info_D)
laplace_B <- my_laplace(rep(0, ncol(Xmat_B)), logistic_logpost, info_B)
The code chunk below confirms all models converged:
purrr::map_chr(list(laplace_B, laplace_D),
'converge')
## [1] "YES" "YES"
The laplace_D, laplace_B object is the Bayesian version of modD, modB respectively that I fit previously.
laplace_B %>% readr::write_rds("laplace_B.rds")
Let’s identify the best using the Evidence based approach! The code chunk below follows the procedure from an earlier assignment. The log-evidence values are extracted from each model’s laplace_ object. The posterior model weight is then calculated by undoing the log and dividing by the sum of the Evidence.
mod_log_evidences <- purrr::map_dbl(list(laplace_B, laplace_D),
'log_evidence')
all_model_weights <- exp( mod_log_evidences ) / sum(exp(mod_log_evidences))
The posterior model weights are compiled into a dataframe. The bar chart below shows the model weights per model.
tibble::tibble(
model_name = LETTERS[seq_along(all_model_weights)],
post_model_weight = all_model_weights
) %>%
ggplot(mapping = aes(x = model_name, y = post_model_weight)) +
geom_bar(stat = 'identity') +
coord_cartesian(ylim = c(0,1)) +
theme_bw()
For this application the log-Evidence based approach identifies the same model as the non-Bayesian BIC (modB was 2nd best model according to BIC for non-bayesian glm). The simpler model B is considered the best!
To visualize the regression coefficient posterior summary statistics for the best model (model B), you can use the coefficient summary obtained from the Laplace approximation.
Plot showing the posterior mode of each coefficient along with the 95% confidence interval, based on the standard errors obtained from the Laplace approximation for model B.
# Extract the mode and variance matrix for the best model (model B)
mode_B <- laplace_B$mode
var_matrix_B <- laplace_B$var_matrix
# Calculate standard errors from the diagonal of the variance matrix
se_B <- sqrt(diag(var_matrix_B))
# Create a dataframe with coefficient names, mode, and standard errors
coef_summary_B <- data.frame(
coefficient = colnames(Xmat_B),
mode = mode_B,
se = se_B
)
# Visualize the regression coefficient posterior summary statistics for model B
ggplot(coef_summary_B, aes(x = coefficient, y = mode)) +
geom_point() +
geom_errorbar(aes(ymin = mode - 1.96 * se, ymax = mode + 1.96 * se), width = 0.2) +
coord_flip() +
theme_bw() +
labs(title = "Regression Coefficient Posterior Summary Statistics (Model B)")
The plot depicting the regression coefficient posterior summary statistics for model B closely resembles the coefficient summary statistics obtained for the non-Bayesian model modB. This similarity underscores the consistency between the Bayesian and non-Bayesian approaches in identifying the important predictors and estimating their effects on the outcome variable. Both analyses provide insights into the posterior mode and uncertainty (represented by the 95% confidence intervals) associated with each coefficient, facilitating a comprehensive understanding of the model’s predictive performance and explanatory power.
Examine the predictive trends of the models to better interpret their behavior. We will visualize the trends on a specifically designed prediction grid. We will use the same grid computed in the last part. The prediction grid is reloaded as viz_grid.
viz_grid %>% glimpse()
## Rows: 784
## Columns: 6
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
The code chunk below starts the function for you and uses just two input arguments, mvn_result and num_samples. Adapt generate_lm_post_samples() and define generate_glm_post_samples(). We cannot directly use the generate_lm_post_samples() because it includes the back-transformation from φ to σ. The logistic regression model does NOT include σ. Since the only unknowns are the β -parameters, we can determine length_beta by just using the length of the posterior mode vector.
generate_glm_post_samples <- function(mvn_result, num_samples)
{
# specify the number of unknown beta parameters
length_beta <- length(mvn_result$mode)
# generate the random samples
beta_samples <- MASS::mvrnorm(n = num_samples,
mu = mvn_result$mode,
Sigma = mvn_result$var_matrix)
# change the data type and name
beta_samples %>%
as.data.frame() %>% tibble::as_tibble() %>%
purrr::set_names(sprintf("beta_%02d", (1:length_beta) - 1))
}
The function, post_logistic_pred_samples() is in the code chunk below. It consists of two input arguments Xnew and Bmat. Xnew is a test design matrix where rows correspond to prediction points. The matrix Bmat stores the posterior samples on the β -parameters, where each row is a posterior sample and each column is a parameter. Then, calculate the posterior smaples of the event probability.
post_logistic_pred_samples <- function(Xnew, Bmat)
{
# calculate the linear predictor at all prediction points and posterior samples
eta_mat <- Xnew %*% t(Bmat)
# calculate the event probability
mu_mat <- boot::inv.logit(eta_mat)
# book keeping
list(eta_mat = eta_mat, mu_mat = mu_mat)
}
The code chunk below defines a function summarize_logistic_pred_from_laplace() which manages the actions necessary to summarize posterior predictions of the event probability. The first argument, mvn_result, is the Laplace Approximation object. The second object is the test design matrix, Xtest, and the third argument, num_samples, is the number of posterior samples to make. This function generate posterior prediction samples of the linear predictor and the event probability, and summarizes the posterior predictions of the event probability.
generate_glm_post_samples(laplace_D, 784)
## # A tibble: 784 × 17
## beta_00 beta_01 beta_02 beta_03 beta_04 beta_05 beta_06 beta_07 beta_08
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1.95 -0.121 1.34 1.08 2.49 -0.296 1.85 4.88 1.04
## 2 -1.04 -0.497 0.0589 0.404 0.683 -0.395 0.736 4.03 1.46
## 3 0.112 -0.559 2.01 1.80 2.58 0.329 2.11 3.11 -0.0119
## 4 -0.172 -0.0646 2.27 1.64 2.99 0.493 2.26 3.55 0.235
## 5 -0.416 -0.588 1.55 1.41 1.83 0.0591 1.77 3.19 0.108
## 6 -1.14 -0.977 0.416 0.339 0.815 -0.815 0.814 4.09 0.563
## 7 -0.609 0.102 1.61 2.01 2.94 0.350 1.87 3.60 0.716
## 8 -0.723 -0.371 2.87 1.78 3.22 0.278 2.48 4.19 1.34
## 9 -0.867 -0.460 0.984 1.40 1.68 -0.216 2.08 4.42 0.206
## 10 -1.92 -0.665 -0.155 0.626 1.17 -0.194 0.364 4.64 0.409
## # ℹ 774 more rows
## # ℹ 8 more variables: beta_09 <dbl>, beta_10 <dbl>, beta_11 <dbl>,
## # beta_12 <dbl>, beta_13 <dbl>, beta_14 <dbl>, beta_15 <dbl>, beta_16 <dbl>
summarize_logistic_pred_from_laplace <- function(mvn_result, Xtest, num_samples)
{
# generate posterior samples of the beta parameters
betas <- generate_glm_post_samples(mvn_result, num_samples)
# data type conversion
betas <- as.matrix(betas)
# make posterior predictions on the test set
pred_test <- post_logistic_pred_samples(Xtest, betas)
# calculate summary statistics on the posterior predicted probability
# summarize over the posterior samples
# posterior mean, should you summarize along rows (rowMeans) or
# summarize down columns (colMeans) ???
mu_avg <- rowMeans(pred_test$mu_mat)
# posterior quantiles
mu_q05 <- apply(pred_test$mu_mat, 1, stats::quantile, probs = 0.05)
mu_q95 <- apply(pred_test$mu_mat, 1, stats::quantile, probs = 0.95)
# book keeping
tibble::tibble(
mu_avg = mu_avg,
mu_q05 = mu_q05,
mu_q95 = mu_q95
) %>%
tibble::rowid_to_column("pred_id")
}
Create the design matrices using the viz_grid dataframe
Xviz_D <- model.matrix(~ Lightness + Saturation + R + G + B + Hue, data = viz_grid )
Xviz_B <- model.matrix(~ Lightness + Saturation, data = viz_grid )
Xviz_D %>% dim()
## [1] 784 17
Xviz_B %>% dim()
## [1] 784 13
Summarize the posterior predicted event probability associated with the three models on the visualization grid. The prediction summarizes should be executed in the code chunk below:
set.seed(8123)
post_pred_summary_B <- summarize_logistic_pred_from_laplace(laplace_B, Xviz_B, 2500)
post_pred_summary_D <- summarize_logistic_pred_from_laplace(laplace_D, Xviz_D, 2500)
Print the dimensions of the objects to the screen:
post_pred_summary_B %>% dim()
## [1] 784 4
post_pred_summary_D %>% dim()
## [1] 784 4
The function creates a figure which visualizes the posterior predictive summary statistics of the event probability for a single model for R, G, B input variables.
these plot include predicted mean event probability and the confidence interval.
viz_bayes_logpost_preds <- function(post_pred_summary, input_df)
{
post_pred_summary %>%
left_join(input_df %>% tibble::rowid_to_column('pred_id'),
by = 'pred_id') %>%
mutate(B_range = cut(B, breaks = c(0, 50, 100, 150, 200, 255),
labels = c("0-50", "51-100", "101-150", "151-200", "201-255"),
include.lowest = TRUE)) %>%
ggplot(mapping = aes(x = G, y = mu_avg)) +
geom_point(mapping = aes(color = R), size = 2.0) +
facet_wrap(~ B_range, labeller = 'label_both') +
geom_ribbon(mapping = aes(ymin = mu_q05,
ymax = mu_q95,
group = interaction(R),
fill = R),
alpha = 0.2) +
labs(x = "G", y = "event probability") +
theme_bw()
}
viz_bayes_logpost_preds(post_pred_summary_B, viz_grid)
#### Use the viz_bayes_logpost_preds() function to visualize posterior
predictive trends of the event probability for the 2 models: model D
viz_bayes_logpost_preds(post_pred_summary_D, viz_grid)
The predictive trends observed in the two chosen generalized linear models exhibit notable disparities. While modB provides a clear depiction of the relationship between R, G, and B values, with distinct confidence interval ribbons for R, facilitating the identification of combinations with higher popularity probability, modD presents a more scattered visualization. Despite this, the confidence intervals in modD offer clarity regarding the association between lower R, G, and B values and higher popularity probabilities. Nonetheless, drawing conclusive insights from the graphs generated by modD proves to be more challenging compared to modB.
viz_bayes_logpost_preds_HSL <- function(post_pred_summary, input_df)
{
post_pred_summary %>%
left_join(input_df %>% tibble::rowid_to_column('pred_id'),
by = 'pred_id') %>%
ggplot(mapping = aes(x = Hue, y = mu_avg)) +
geom_point(mapping = aes(color = Lightness), size = 2.0) +
facet_wrap(~ Saturation, labeller = 'label_both') +
geom_ribbon(mapping = aes(ymin = mu_q05,
ymax = mu_q95,
group = interaction(Lightness),
fill = Lightness),
alpha = 0.2) +
labs(x = "G", y = "event probability") +
theme_bw()
}
viz_bayes_logpost_preds_HSL(post_pred_summary_B, viz_grid)
viz_bayes_logpost_preds_HSL(post_pred_summary_D, viz_grid)
Once more, there’s a lack of consistency in the event probability between the two top-performing models. ModB offers a comprehensive insight into how HSL values impact the event probability, presenting a clearer picture overall. However, it’s worth noting that for Saturation Gray and Saturation bright, both models align in their predictions, indicating a level of consistency. In fact, modD appears to exhibit robust generalization capabilities in this regard.
Instructions: 1. You must train, assess, tune, and compare more complex methods via resampling. 2. You may use either caret or tidymodels to handle the preprocessing, training, testing, and evaluation.
The code chunk below imports the caret package
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
The caret package prefers the binary outcome to be organized as a factor data type compared to an integer. In the Data Exploration part of the project, the categorical values in Lightness and Saturation Columns were reformatted to have integers. So, all the input variables like R, G, B, Hue, LightnessNum and SaturationNum are reformatted to be consistent with the caret preferred structure.
train_dataset %>% glimpse()
## Rows: 835
## Columns: 15
## $ R <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 14…
## $ G <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, …
## $ B <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132,…
## $ Lightness <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", …
## $ Saturation <chr> "bright", "bright", "bright", "bright", "bright", "brigh…
## $ Hue <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 1…
## $ response <dbl> 12, 10, 16, 10, 11, 16, 10, 19, 14, 25, 14, 19, 14, 38, …
## $ outcome <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ LightnessNum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ SaturationNum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ y <dbl> -1.9924302, -2.1972246, -1.6582281, -2.1972246, -2.09074…
## $ logit_R <dbl> 0.72865387, -2.17562547, 0.72865387, -2.09274551, 0.6931…
## $ logit_G <dbl> -Inf, -1.62924054, -1.40691365, -1.66965677, -3.08534443…
## $ logit_B <dbl> -2.4624338, 0.1784828, -2.7950616, 0.1996131, -2.7950616…
## $ logit_Hue <dbl> -2.36712361, 1.79175947, -1.38629436, 2.04769284, -2.047…
B. Categoriglm variables only – linear additive
modB <- glm(outcome ~ Lightness + Saturation, data = train_dataset, family = binomial)
Can consider interactions of basis functions with other basis functions.
modI <- glm(outcome ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3), data = train_dataset, family = binomial)
library(caret)
The caret package prefers the binary outcome to be organized as a factor data type compared to an integer. The data set is reformatted for you in the code chunk below. The binary outcome y is converted to a new variable outcome with values ‘event’ and ‘non_event’. The first level is forced to be ‘event’ to be consistent with the caret preferred structure.
# Assuming the outcome column in train_dataset is named "outcome_column"
df_caret <- train_dataset %>%
mutate(outcome = ifelse(outcome == 1, 'popular', 'unpopular')) %>%
mutate(outcome = factor(outcome, levels = c("popular", "unpopular"))) %>%
select(R, G, B, Hue, Lightness, Saturation, outcome)
glimpse(df_caret)
## Rows: 835
## Columns: 7
## $ R <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 144, …
## $ G <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, 163…
## $ B <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132, 50…
## $ Hue <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 12, …
## $ Lightness <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", "da…
## $ Saturation <chr> "bright", "bright", "bright", "bright", "bright", "bright",…
## $ outcome <fct> popular, popular, popular, unpopular, unpopular, unpopular,…
Specify the resampling scheme to be 10 fold with 3 repeats. Assign the result of the trainControl() function to the my_ctrl object. Specify the primary performance metric to be ‘Accuracy’ and assign that to the my_metric object.
my_ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 3)
my_metric <- "Accuracy"
I will start by training, assessing, and tuning an elastic net model using the default caret tuning grid. For this, I’ll use the caret::train() function with the formula interface, specifying a model consistent with model H. This model should interact the categorical input with the linear main continuous effects, interaction between continuous, and quadratic continuous features. I need to ensure the binary outcome is named ‘outcome’ instead of ‘y’ in the formula. I’ll set the method argument to ‘glmnet’ and the metric argument to my_metric. Although the inputs were standardized, I’ll instruct caret to standardize the features by setting the preProcess argument to c(‘center’, ‘scale’). This will allow me to practice standardizing inputs. Additionally, I’ll assign the trControl argument to the my_ctrl object.
It’s important to note that even though I’m using glmnet to fit the model, caret does not require me to organize the design matrix as required by glmnet. Therefore, I don’t need to remove the intercept when defining the formula to caret::train().
After training, assessing, and tuning the glmnet elastic net model, I’ll assign the result to the enet_default object and display the result to the screen. Then, I’ll analyze which tuning parameter combinations are considered the best. I’ll also determine whether the best set of tuning parameters is more consistent with Lasso or Ridge regression.
The random seed is reset before each call to train(). This is critical when using caret to train multiple models on the same data. We want each model to use the same resample splits. Setting the seed before each call to train() makes sure that is indeed the case. If we do not set the seed the resample fold training sets and test sets will be different! We will therefore not compare the models on the exact same data.
set.seed(2001)
# Train and tune the model using caret::train()
modD_default <- train(
outcome ~ Lightness + Saturation + R + G + B + Hue,
data = df_caret,
method = "glm",
family = binomial,
preProcess = c("center", "scale"),
trControl = my_ctrl
)
modD_default
## Generalized Linear Model
##
## 835 samples
## 6 predictor
## 2 classes: 'popular', 'unpopular'
##
## Pre-processing: centered (16), scaled (16)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8195765 0.3733741
set.seed(2001)
# Train and tune the model using caret::train()
modF_default <- train(
outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation,
data = df_caret,
method = "glm",
family = binomial,
preProcess = c("center", "scale"),
trControl = my_ctrl
)
modF_default
## Generalized Linear Model
##
## 835 samples
## 6 predictor
## 2 classes: 'popular', 'unpopular'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8132078 0.3887981
In terms of accuracy, modD is more accurate than modF
set.seed(2001)
# Train and tune the model using caret::train()
modB_default <- train(
outcome ~ Lightness + Saturation,
data = df_caret,
method = "glm",
family = binomial,
preProcess = c("center", "scale"),
trControl = my_ctrl
)
modB_default
## Generalized Linear Model
##
## 835 samples
## 2 predictor
## 2 classes: 'popular', 'unpopular'
##
## Pre-processing: centered (12), scaled (12)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8227894 0.370362
The accuracy for this model is higher than modF and modD.
modB_default %>% readr::write_rds("modB_default.rds")
set.seed(2001)
# Train and tune the model using caret::train()
modI_default <- train(
outcome ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3),
data = df_caret,
method = "glm",
family = binomial,
preProcess = c("center", "scale"),
trControl = my_ctrl
)
modI_default
## Generalized Linear Model
##
## 835 samples
## 4 predictor
## 2 classes: 'popular', 'unpopular'
##
## Pre-processing: centered (12), scaled (12)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7948533 0.2758227
The accuracy for this model is lesser than all the models explored so far.
set.seed(2001)
modF_glmn_default <- train( outcome ~ (R + G + B + Hue)^2 + Lightness + Saturation,
data = df_caret,
method = 'glmnet',
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
modF_glmn_default
## glmnet
##
## 835 samples
## 6 predictor
## 2 classes: 'popular', 'unpopular'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.10 0.0003523518 0.8255860 0.4025064
## 0.10 0.0035235183 0.8239798 0.3821878
## 0.10 0.0352351830 0.8227894 0.3703620
## 0.55 0.0003523518 0.8231764 0.4002840
## 0.55 0.0035235183 0.8231766 0.3785039
## 0.55 0.0352351830 0.8227894 0.3703620
## 1.00 0.0003523518 0.8176016 0.3909140
## 1.00 0.0035235183 0.8239798 0.3794697
## 1.00 0.0352351830 0.8227894 0.3703620
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.0003523518.
set.seed(2001)
modB_glmn_default <- train( outcome ~ Lightness + Saturation,
data = df_caret,
method = 'glmnet',
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
modB_glmn_default
## glmnet
##
## 835 samples
## 2 predictor
## 2 classes: 'popular', 'unpopular'
##
## Pre-processing: centered (12), scaled (12)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 752, 751, 752, 752, 751, 752, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.10 0.0003523518 0.8227894 0.370362
## 0.10 0.0035235183 0.8227894 0.370362
## 0.10 0.0352351830 0.8227894 0.370362
## 0.55 0.0003523518 0.8227894 0.370362
## 0.55 0.0035235183 0.8227894 0.370362
## 0.55 0.0352351830 0.8227894 0.370362
## 1.00 0.0003523518 0.8227894 0.370362
## 1.00 0.0035235183 0.8227894 0.370362
## 1.00 0.0352351830 0.8227894 0.370362
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.03523518.
Among the two trained models, modF has slightly higher accuracy than modB.
library(caret)
df3_caret <- train_dataset %>%
mutate(outcome = ifelse(outcome == 1, 'popular', 'unpopular')) %>%
mutate(outcome = factor(outcome, levels = c("popular", "unpopular"))) %>%
select(R, G, B, Hue, Saturation, Lightness, outcome)
df3_caret %>% glimpse()
## Rows: 835
## Columns: 7
## $ R <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 144, …
## $ G <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, 163…
## $ B <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132, 50…
## $ Hue <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 12, …
## $ Saturation <chr> "bright", "bright", "bright", "bright", "bright", "bright",…
## $ Lightness <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", "da…
## $ outcome <fct> popular, popular, popular, unpopular, unpopular, unpopular,…
Training a neural network to classify the binary outcome, outcome, with respect to all inputs. Since the neural network will attempt to create non-linear relationships, I’m not specifying interaction between the inputs.
The df_caret dataframe only consists of the inputs and the binary outcome. Thus, we can create a formula for all inputs via the dot operator, outcome ~ . The dot operator streamlines the formula interface when the dataframe ONLY consists of inputs and the output.
Train, assess, and tune the nnet neural network with the defined resampling scheme. Assign the result to the nnet_default object and print the result to the screen.
my_ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 3)
my_metric_p3 <- "Accuracy"
set.seed(1234)
nnet_default <- train( outcome ~ .,
data = df3_caret,
method = 'nnet',
metric = my_metric_p3,
preProcess = c("center", "scale"),
trControl = my_ctrl,
trace = FALSE)
nnet_default
## Neural Network
##
## 835 samples
## 6 predictor
## 2 classes: 'popular', 'unpopular'
##
## Pre-processing: centered (16), scaled (16)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.7864653 0.2034894
## 1 1e-04 0.7940863 0.2150129
## 1 1e-01 0.8176093 0.3743426
## 3 0e+00 0.8092469 0.4158233
## 3 1e-04 0.8076219 0.4173052
## 3 1e-01 0.8331720 0.4839001
## 5 0e+00 0.8036106 0.4110662
## 5 1e-04 0.7984761 0.4180963
## 5 1e-01 0.8315325 0.4867369
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 3 and decay = 0.1.
The accuracy for default Neural network is for size = 3: decay: 1e-01 accuracy: 0.8331720 Kappa: 0.4839001 As shown by the above display, the best neural network has size = 3 and decay = 0.1. The best neural network therefore consists of 3 hidden units. This is not a large neural network. We would want to consider trying more hidden units to see if the results can be improved further.
plot(nnet_default, xTrans=log)
The code chunk below defines a custom tuning grid focused on tuning the decay parameter. The decay parameter is just the regularization strength and thus is similar to lambda used in elastic net with the ridge penalty!
nnet_grid <- expand.grid(size = c(5, 10, 20),
decay = exp(seq(-6, 2, length.out=11)))
nnet_grid %>% dim()
## [1] 33 2
The more refined neural network tuning grid is used below.
set.seed(1234)
nnet_tune <- train( outcome ~ .,
data = df3_caret,
method = 'nnet',
metric = my_metric_p3,
tuneGrid = nnet_grid,
preProcess = c("center", "scale"),
trControl = my_ctrl,
trace = FALSE)
nnet_tune$bestTune
## size decay
## 28 20 0.1353353
nnet_tune
## Neural Network
##
## 835 samples
## 6 predictor
## 2 classes: 'popular', 'unpopular'
##
## Pre-processing: centered (16), scaled (16)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 5 0.002478752 0.8196174 0.4603789
## 5 0.005516564 0.8036253 0.4067763
## 5 0.012277340 0.8143683 0.4501354
## 5 0.027323722 0.8279227 0.4905871
## 5 0.060810063 0.8244321 0.4762968
## 5 0.135335283 0.8255656 0.4638801
## 5 0.301194212 0.8320104 0.4724437
## 5 0.670320046 0.8275731 0.4330740
## 5 1.491824698 0.8212094 0.3688582
## 5 3.320116923 0.8212094 0.3656789
## 5 7.389056099 0.7848542 0.1359134
## 10 0.002478752 0.8039928 0.4392154
## 10 0.005516564 0.8055900 0.4497285
## 10 0.012277340 0.8168157 0.4773341
## 10 0.027323722 0.8223621 0.4889601
## 10 0.060810063 0.8248145 0.4917687
## 10 0.135335283 0.8339943 0.5054810
## 10 0.301194212 0.8292083 0.4742128
## 10 0.670320046 0.8291798 0.4444163
## 10 1.491824698 0.8211998 0.3704140
## 10 3.320116923 0.8220078 0.3685276
## 10 7.389056099 0.8076216 0.2819806
## 20 0.002478752 0.7980069 0.4371546
## 20 0.005516564 0.8136648 0.4695291
## 20 0.012277340 0.8160224 0.4779194
## 20 0.027323722 0.8212383 0.4886769
## 20 0.060810063 0.8255842 0.4970033
## 20 0.135335283 0.8363562 0.5176417
## 20 0.301194212 0.8319720 0.4806440
## 20 0.670320046 0.8303844 0.4448305
## 20 1.491824698 0.8211998 0.3684223
## 20 3.320116923 0.8224094 0.3695368
## 20 7.389056099 0.8167919 0.3353003
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 20 and decay = 0.1353353.
The accuracy for default Neural network is for size = 3: decay: 1e-01 accuracy: 0.8331720 Kappa: 0.4839001 The accuracy for Tuned Neural network is for size = 20: decay: 0.135335283 accuracy: 0.8363562 kappa: 0.5176417 So, Tuned neural network has better performance than default. The tuning results are visualized below for the refined larger neural network. By default caret identifies whichever model has the overall best performance. The plot below reveals there is very little difference between the smaller neural network with size 20 hidden units compared to the overall best with 3 hidden units.
plot(nnet_tune, xTrans=log)
The tuning results are visualized below for the refined larger neural network. By default caret identifies whichever model has the overall best performance. The plot below reveals there is very little difference between the smaller neural network with 5 hidden units compared to the overall best with 20 hidden units. I would prefer the less complex neural network in this case. The figure also reveals how the very low regularization strength on the left side of the plot and the very high regularization strength on the ridge side of the plot has worse performance compared to the “middle ground” with moderate regularization strength. The figure below is revealing overfit complex models on the left side and the underfit overly simple models on the right hand side
The pred_viz_nnet_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_nnet_df, provides the class predicted probabilities for each input combination in the visualization grid.
pred_viz_nnet_probs <- predict( nnet_default, newdata = viz_grid, type = 'prob' )
viz_nnet_df <- viz_grid %>% bind_cols(pred_viz_nnet_probs)
viz_nnet_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 2.318684e-02, 9.123897e-02, 4.223825e-02, 2.645848e-01, 8.9…
## $ unpopular <dbl> 0.97681316, 0.90876103, 0.95776175, 0.73541515, 0.10568882,…
The glimpse reveals that the event column stores the predicted event probability. Then visualize the predicted event probability in a manner consistent with the viz_bayes_logpost_preds() function and the tuned elastic net model predictions. Visualize the predicted probability as a line (curve) with respect to G, for each combination of B and R.
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_nnet_df$B_range <- cut(viz_nnet_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_nnet_df %>%
ggplot(mapping = aes(x = G, y = popular)) +
geom_point(mapping = aes(color = R), size = 2.0) +
facet_wrap(~B_range, labeller = 'label_both') +
theme_bw()
I preferred analysis outcome (popularity) by RGB and HSL color models. By visual interpretation of the outcome predictions, I aimed to identify color combinations within the RGB color model that exhibit a high level of popularity. This information could be valuable for PPG company in their efforts to create new colors.
By studying the combinations of red, green, and blue values that are associated with higher popularity, PPG can potentially synthesize new colors that are more likely to resonate with consumers.
viz_nnet_df %>%
ggplot(mapping = aes(x = Hue, y = popular)) +
geom_point(mapping = aes(color = Lightness), size = 2.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
#### Interpretations of the HSL graph suggests that colors with higher
probability values of popularity have: 1. H: 15 to 30; S: Bright; L:
dark/soft/pale 2. H: 0 to 15; S: neural; L: saturated, pale, midtone,
soft, light
Let’s now a tree based method. You will use the default tuning grid. Tree based models do not have the same kind of preprocessing requirements as other models. Train a random forest binary classifier by setting the method argument equal to “rf”. You must set importance = TRUE in the caret::train() function call. Assign the result to the rf_default variable. Display the rf_default object to the screen.
The random forest model is trained and tuned below. The formula interface uses the . operator to streamline selecting all inputs in the df_caret dataframe.
set.seed(1234)
rf_default <- train( outcome ~ .,
data = df3_caret,
method = 'rf',
metric = my_metric_p3,
trControl = my_ctrl,
importance = TRUE)
rf_default
## Random Forest
##
## 835 samples
## 6 predictor
## 2 classes: 'popular', 'unpopular'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8259860 0.3792716
## 9 0.8546397 0.5585412
## 16 0.8578284 0.5708417
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 16.
The accuracy for default Random Forest is 0.8578284 and kappa = 0.5708417 for mtry = 16
Let’s make predictions on the visualization grid, viz_grid, using the random forest model rf_default. Instruct thepredict()function to return the probabilities by setting type = ‘prob’.
pred_viz_rf_probs <- predict( rf_default, newdata = viz_grid, type = 'prob' )
The pred_viz_rf_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_rf_df, provides the class predicted probabilities for each input combination in the visualization grid according to the random forest model.
viz_rf_df <- viz_grid %>% bind_cols(pred_viz_rf_probs)
viz_rf_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 0.030, 0.034, 0.012, 0.130, 0.758, 0.252, 0.678, 0.202, 0.3…
## $ unpopular <dbl> 0.970, 0.966, 0.988, 0.870, 0.242, 0.748, 0.322, 0.798, 0.6…
The glimpse reveals that the popular column stores the predicted event probability.
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_rf_df$R_range <- cut(viz_rf_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rf_df$G_range <- cut(viz_rf_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rf_df$B_range <- cut(viz_rf_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rf_df %>%
ggplot(mapping = aes(x = G, y = popular)) +
geom_point(mapping = aes(color = R),
size = 1.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
#### Interpretations of the RGB graph suggests that colors with higher
probability values of popularity have: 1. R:0 to 100 G: 0 to 100 B:0 to
50 2. R:0 to 200 G: 0 to 50 B: 51 to 100 3. R:0 to 200 G: 0 to 75 B:
101-150 4. R: 0 to 255 G: 0 to 50 B:151 to 200 5. R: 0 to 255 G: 0 to 50
B:201 to 255
viz_rf_df %>%
ggplot(mapping = aes(x = Hue, y = popular)) +
geom_point(mapping = aes(color = Lightness), size = 1.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
The predictions generated by the random forest model appear to be fluctuating abruptly and inconsistently. This is because random forest is a tree based method. Decision trees do not produce smooth predictions.
# Define the grid of hyperparameters
rf_grid <- expand.grid(
mtry = c(2, 4, 6),
nodesize = c(5, 10, 20)
)
# Check the dimensions of the grid
rf_grid <- as.data.frame(rf_grid)
Tuning the random forest:
# Determine the number of predictor variables
num_predictors <- ncol(df_caret) - 1
# Train the random forest model
rf_tune <- train(outcome ~ R + G + B + Hue + Saturation + Lightness,
data = df3_caret,
method = 'rf',
metric = my_metric_p3,
tuneGrid = expand.grid(.mtry = seq(1, num_predictors)),
preProcess = c("center", "scale"),
trControl = my_ctrl,
trace = FALSE)
# Print the best hyperparameters
rf_tune
## Random Forest
##
## 835 samples
## 6 predictor
## 2 classes: 'popular', 'unpopular'
##
## Pre-processing: centered (16), scaled (16)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 752, 752, 751, 751, 751, 751, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.7712590 0.0000000
## 2 0.8259636 0.3735576
## 3 0.8467233 0.4925533
## 4 0.8535362 0.5345724
## 5 0.8579586 0.5571733
## 6 0.8583507 0.5599688
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
Based on the accuracy metric, the tuned random forest model appears to perform better than the default random forest model.
plot(rf_default, xTrans=log)
plot(rf_tune, xTrans=log)
pred_viz_rft_probs <- predict( rf_tune, newdata = viz_grid, type = 'prob' )
The pred_viz_rft_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_rft_df, provides the class predicted probabilities for each input combination in the visualization grid according to the random forest model.
viz_rft_df <- viz_grid %>% bind_cols(pred_viz_rft_probs)
viz_rft_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 0.060, 0.130, 0.058, 0.122, 0.474, 0.208, 0.456, 0.170, 0.4…
## $ unpopular <dbl> 0.940, 0.870, 0.942, 0.878, 0.526, 0.792, 0.544, 0.830, 0.5…
The glimpse reveals that the event column stores the predicted event probability for tuned Random forest.
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_rft_df$R_range <- cut(viz_rf_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rft_df$G_range <- cut(viz_rf_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rft_df$B_range <- cut(viz_rf_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rft_df %>%
ggplot(mapping = aes(x = G, y = popular)) +
geom_point(mapping = aes(color = R),
size = 1.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
viz_rft_df %>%
ggplot(mapping = aes(x = Hue, y = popular)) +
geom_point(mapping = aes(color = Lightness), size = 2.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
The Gradient Boosted Tree model is trained and tuned below.
# Set the seed for reproducibility
set.seed(1234)
# Train the Gradient Boosted Trees model
gbm_default <- train(outcome ~ .,
data = df3_caret,
method = 'gbm',
metric = "Accuracy",
trControl = my_ctrl,
verbose = FALSE)
# Print the default GBM model
gbm_default$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 8 100 3 0.1 10
gbm_default
## Stochastic Gradient Boosting
##
## 835 samples
## 6 predictor
## 2 classes: 'popular', 'unpopular'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8240157 0.3836788
## 1 100 0.8279700 0.4048482
## 1 150 0.8283810 0.4087492
## 2 50 0.8279748 0.4085514
## 2 100 0.8411756 0.4747118
## 2 150 0.8475298 0.5090054
## 3 50 0.8391577 0.4599565
## 3 100 0.8554810 0.5376428
## 3 150 0.8554809 0.5493598
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
The accuracy for default GBM is 0.8554810, the final values used for the model were n.trees = 100, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
pred_viz_gbm_probs <- predict( gbm_default, newdata = viz_grid, type = 'prob' )
viz_gbm_df <- viz_grid %>% bind_cols(pred_viz_gbm_probs)
viz_gbm_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 0.04296871, 0.16840976, 0.04159066, 0.03144592, 0.72002648,…
## $ unpopular <dbl> 0.95703129, 0.83159024, 0.95840934, 0.96855408, 0.27997352,…
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_gbm_df$R_range <- cut(viz_gbm_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbm_df$G_range <- cut(viz_gbm_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbm_df$B_range <- cut(viz_gbm_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbm_df %>%
ggplot(mapping = aes(x = G, y = popular)) +
geom_point(mapping = aes(color = R),
size = 1.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
viz_gbm_df %>%
ggplot(mapping = aes(x = Hue, y = popular)) +
geom_point(mapping = aes(color = Lightness), size = 1.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
# Train and tune the Gradient Boosted Trees model directly in the train function
gbm_tune <- train(outcome ~ .,
data = df3_caret,
method = 'gbm',
metric = "Accuracy", # Use accuracy as the evaluation metric
trControl = my_ctrl,
tuneGrid = expand.grid(.n.trees = c(100, 200, 300),
.interaction.depth = seq(1, num_predictors),
.shrinkage = c(0.01, 0.1, 0.3),
.n.minobsinnode = c(5, 10, 20)),
verbose = FALSE)
gbm_tune$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 101 200 6 0.1 5
The accuracy for default GBM is 0.8554810, the final values used for the model were n.trees = 100, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10. Tuned Accuracy was used to select the optimal model using the largest value - 0.8599150 and kappa - 0.5796735 The final values used for the model were n.trees = 200, interaction.depth = 6, shrinkage = 0.1 and n.minobsinnode = 5. Slightly greater than default.
plot(gbm_default, xTrans=log)
plot(gbm_tune, xTrans=log)
pred_viz_gbmt_probs <- predict( gbm_tune, newdata = viz_grid, type = 'prob' )
The pred_viz_gbmt_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_gbm_df, provides the class predicted probabilities for each input combination in the visualization grid according to the random forest model.
viz_gbmt_df <- viz_grid %>% bind_cols(pred_viz_gbmt_probs)
viz_gbmt_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 0.042555821, 0.047380040, 0.056188569, 0.004684948, 0.31366…
## $ unpopular <dbl> 0.95744418, 0.95261996, 0.94381143, 0.99531505, 0.68633079,…
The glimpse reveals that the event column stores the predicted event probability for tuned Gradient Boosting Tree model.
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_gbmt_df$R_range <- cut(viz_gbmt_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbmt_df$G_range <- cut(viz_gbmt_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbmt_df$B_range <- cut(viz_gbmt_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbmt_df %>%
ggplot(mapping = aes(x = G, y = popular)) +
geom_point(mapping = aes(color = R),
size = 1.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
viz_gbmt_df %>%
ggplot(mapping = aes(x = Hue, y = popular)) +
geom_point(mapping = aes(color = Lightness), size = 1.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
# Set seed for reproducibility
set.seed(1234)
gam_model <- train(
outcome ~ .,
data = df3_caret,
method = "gam",
metric = "Accuracy",
trControl = my_ctrl,
tuneGrid = expand.grid(
select = c(0.1, 0.2, 0.3), # Specify the smoothing parameter values to tune
method = "GCV.Cp" # Specify the method for tuning the smoothing parameter
)
)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
# Print the best hyperparameters
gam_model$bestTune
## select method
## 1 0.1 GCV.Cp
The output you provided indicates that the best hyperparameters chosen by the tuning process are:
select: 0.1 method: GCV.Cp Here’s what each of these means:
select: This corresponds to the smoothing parameter value chosen for the GAM model. Smoothing is a technique used in GAMs to deal with non-linear relationships between predictors and the outcome. A smaller select value indicates less smoothing, allowing the model to capture more complex patterns in the data. In this case, the tuning process has found that a select value of 0.1 resulted in the best performance based on the chosen metric (in this case, accuracy). method: This specifies the method used for tuning the smoothing parameter. In GAMs, there are different methods available for selecting the optimal smoothing parameter value. Common methods include generalized cross-validation (GCV) and the Cp criterion. Here, the tuning process used the GCV.Cp method.
pred_viz_gam_probs <- predict( gam_model, newdata = viz_grid, type = 'prob' )
viz_gam_df <- viz_grid %>% bind_cols(pred_viz_gam_probs)
viz_gam_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 0.0032961452, 0.1341051392, 0.0026029970, 0.0003317157, 0.9…
## $ unpopular <dbl> 9.967039e-01, 8.658949e-01, 9.973970e-01, 9.996683e-01, 1.4…
The glimpse reveals that the event column stores the predicted event probability. Then visualize the predicted event probability in a manner consistent with the viz_bayes_logpost_preds() function and the tuned elastic net model predictions. Visualize the predicted probability as a line (curve) with respect to G, for each combination of B and R.
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_gam_df$R_range <- cut(viz_gam_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gam_df$G_range <- cut(viz_gam_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gam_df$B_range <- cut(viz_gam_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gam_df %>%
ggplot(mapping = aes(x = G, y = popular)) +
geom_point(mapping = aes(color = R), size = 2.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() + # Add contour lines to represent point density
theme_bw()
viz_gam_df %>%
ggplot(mapping = aes(x = Hue, y = popular)) +
geom_point(mapping = aes(color = Lightness), size = 2.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
Tuning Generalized Additive Models (GAM):
tuning_results <- gam_model$results
print(tuning_results)
## select method Accuracy Kappa AccuracySD KappaSD
## 1 0.1 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
## 2 0.2 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
## 3 0.3 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
# Plot tuning results
ggplot(tuning_results, aes(x = select, y = Accuracy)) +
geom_line() + # Line plot
geom_point() + # Add points
labs(x = "Smoothing Parameter Value (select)", y = "Accuracy") +
ggtitle("Tuning Results for Generalized Additive Model (GAM)") +
theme_minimal()
Summary: - Accuracy: 0.8622984 - Kappa: 0.5755467 - Accuracy Standard Deviation (SD): 0.03498198 - Kappa Standard Deviation (SD): 0.1087822
Since all three variants have identical accuracy and kappa values, there is no difference in performance among them based on these metrics.
# Set seed for reproducibility
set.seed(1234)
gam_tune <- train(
outcome ~ .,
data = df3_caret,
method = "gam",
metric = "Accuracy",
trControl = my_ctrl,
tuneGrid = expand.grid(
select = c(0.3, 3, 6), # Specify the smoothing parameter values to tune
method = "GCV.Cp" # Specify the method for tuning the smoothing parameter
)
)
# Print the best hyperparameters
gam_tune$results
## select method Accuracy Kappa AccuracySD KappaSD
## 1 0.3 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
## 2 3.0 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
## 3 6.0 GCV.Cp 0.8622984 0.5755467 0.03498198 0.1087822
Based on the provided result for the GAM model, it seems that all three variants of the model (with smoothing parameter values of 0.3, 3.0, and 6.0) have the same accuracy and kappa values. Here’s a breakdown of the metrics:
Since all three variants have identical accuracy and kappa values, there is no difference in performance among them based on these metrics. It’s possible that the smoothing parameter values do not significantly affect the model’s performance in terms of accuracy and kappa.
The accuracy for GAM tuned model for 0.8622984. The accuracy is same before and after tuning.
pred_viz_gamt_probs <- predict( gam_tune, newdata = viz_grid, type = 'prob' )
viz_gamt_df <- viz_grid %>% bind_cols(pred_viz_gamt_probs)
viz_gamt_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 0.0032961452, 0.1341051392, 0.0026029970, 0.0003317157, 0.9…
## $ unpopular <dbl> 9.967039e-01, 8.658949e-01, 9.973970e-01, 9.996683e-01, 1.4…
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_gamt_df$R_range <- cut(viz_gamt_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gamt_df$G_range <- cut(viz_gamt_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gamt_df$B_range <- cut(viz_gamt_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gamt_df %>%
ggplot(mapping = aes(x = G, y = popular)) +
geom_point(mapping = aes(color = R), size = 2.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() + # Add contour lines to represent point density
theme_bw()
viz_gamt_df %>%
ggplot(mapping = aes(x = Hue, y = popular)) +
geom_point(mapping = aes(color = Lightness), size = 2.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
K-Nearest Neighbors (KNN) is a simple yet effective algorithm used for both classification and regression tasks. It works by identifying the K nearest data points to a given query point and predicting the class or value based on the most common class or average value among its neighbors, respectively.
We now train the KNN model using the train function from the caret package. We specify the method as “knn” and use accuracy as the evaluation metric.
set.seed(1234)
knn_default <- train(outcome ~ .,
data = df3_caret,
method = "knn",
trControl = my_ctrl,
tuneLength = 10) # Set the desired value of K
knn_default
## k-Nearest Neighbors
##
## 835 samples
## 6 predictor
## 2 classes: 'popular', 'unpopular'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8075931 0.4400930
## 7 0.8155969 0.4443954
## 9 0.8032517 0.3946884
## 11 0.8036440 0.3817466
## 13 0.7976579 0.3550789
## 15 0.7951959 0.3369989
## 17 0.7991455 0.3349957
## 19 0.7959851 0.3105241
## 21 0.7951911 0.2982999
## 23 0.7915671 0.2766259
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
The KNN default accuracy is 0.8075931 with k = 5.
From the above outcomes we can see that,
The k-Nearest Neighbors (KNN) algorithm was applied to classify samples into ‘popular_paint’ and ‘non_popular_paint’ classes using a dataset with 835 samples and 6 predictor variables. Through cross-validated performance evaluation, KNN was tuned over a range of K values from 5 to 23. The optimal model achieved an accuracy of approximately 80.48% with a K value of 9. This indicates that when considering the 9 nearest neighbors, the model correctly predicts the class label for about 80.48% of the samples. KNN’s simplicity and effectiveness make it a valuable tool for classification tasks, especially when dealing with relatively small datasets.
# Predictions to understand the behavior of the Gradient Boosted Trees model
pred_viz_knn_probs <- predict(knn_default, newdata = viz_grid, type = 'prob')
# Combine predictions with visualization grid
viz_knn_df <- viz_grid %>% bind_cols(pred_viz_knn_probs)
# Display a glimpse of the combined dataframe
viz_knn_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 0.1428571, 0.4285714, 0.7142857, 0.1428571, 0.1428571, 0.14…
## $ unpopular <dbl> 0.8571429, 0.5714286, 0.2857143, 0.8571429, 0.8571429, 0.85…
Plot for nb_default
Lets try to analyze it by a graph
plot(knn_default,main="KNN Default Plot", xTrans=log)
The plot showcases the relationship between the number of neighbors (K) considered in the k-Nearest Neighbors (KNN) algorithm and the corresponding classification accuracy. As the number of neighbors increases from 0 to 4 along the x-axis, the accuracy initially declines until it reaches its lowest point around 2.6. Subsequently, there’s a gradual rise in accuracy until approximately 2.8 neighbors, followed by a slight decrease. Notably, the highest accuracy is achieved when K equals 1, denoting that the model performs optimally when considering only the nearest neighbor for classification. This pattern suggests a trade-off between the complexity of the model (as determined by the number of neighbors) and its predictive accuracy, with an optimal balance observed at K=1.
Tuning KNN
# Define a tuning grid for KNN
knn_grid <- expand.grid(k = seq(1, 20, by = 2)) # Define the range of K values
set.seed(1234)
knn_tune <- train(outcome ~ .,
data = df3_caret,
method = "knn",
trControl = my_ctrl,
tuneGrid = knn_grid)
knn_tune
## k-Nearest Neighbors
##
## 835 samples
## 6 predictor
## 2 classes: 'popular', 'unpopular'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 751, 751, 751, 752, 752, 752, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.7984282 0.4366710
## 3 0.8095540 0.4562571
## 5 0.8087932 0.4434316
## 7 0.8155969 0.4443954
## 9 0.8032565 0.3952688
## 11 0.8032472 0.3809159
## 13 0.7976579 0.3550789
## 15 0.7955975 0.3387744
## 17 0.7987487 0.3335358
## 19 0.7967883 0.3138451
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
The KNN tuned accuracy is 0.8155969 with k = 7, which is slightly greater than default.
Predictions to understand the behavior of the tuned knn model
# Predictions to understand the behavior of the Gradient Boosted Trees model
pred_viz_knn_tune_probs <- predict(knn_tune, newdata = viz_grid, type = 'prob')
# Combine predictions with visualization grid
viz_knn_tune_df <- viz_grid %>% bind_cols(pred_viz_knn_tune_probs)
# Display a glimpse of the combined dataframe
viz_knn_tune_df %>% glimpse()
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 0.1428571, 0.4285714, 0.7142857, 0.1428571, 0.1428571, 0.14…
## $ unpopular <dbl> 0.8571429, 0.5714286, 0.2857143, 0.8571429, 0.8571429, 0.85…
Plotting Tuned model
# Plot the performance of Naive Bayes model
plot(knn_default, main="KNN Tuned Plot", xTrans=log)
Predictions
# Predictions for Naive Bayes default model
pred_viz_knn_probs_default <- predict(knn_default, newdata = viz_grid, type = 'prob')
# Combine predictions with the visualization grid
viz_knn_df_default <- bind_cols(viz_grid, as.data.frame(pred_viz_knn_probs_default))
# Glimpse the combined dataframe
glimpse(viz_knn_df_default)
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 0.1428571, 0.4285714, 0.7142857, 0.1428571, 0.1428571, 0.14…
## $ unpopular <dbl> 0.8571429, 0.5714286, 0.2857143, 0.8571429, 0.8571429, 0.85…
# Predictions for tuned Naive Bayes model
pred_viz_knn_probs_tune <- predict(knn_tune, newdata = viz_grid, type = 'prob')
# Combine predictions with the visualization grid
viz_knn_df_tune <- bind_cols(viz_grid, as.data.frame(pred_viz_knn_probs_tune))
# Glimpse the combined dataframe
glimpse(viz_knn_df_tune)
## Rows: 784
## Columns: 8
## $ R <int> 105, 156, 110, 4, 243, 222, 92, 45, 46, 199, 230, 109, 189,…
## $ G <int> 189, 193, 114, 128, 27, 110, 24, 240, 21, 23, 89, 16, 220, …
## $ B <int> 92, 163, 103, 218, 162, 56, 161, 134, 131, 205, 15, 150, 20…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 18, 8, 17, 1, 16, 28, 3, 16, 0, 2, 21, 13, 11, 9, 8, 19, 16…
## $ popular <dbl> 0.1428571, 0.4285714, 0.7142857, 0.1428571, 0.1428571, 0.14…
## $ unpopular <dbl> 0.8571429, 0.5714286, 0.2857143, 0.8571429, 0.8571429, 0.85…
Visulatization
viz_knn_df_default %>%
ggplot(mapping = aes(x = Hue, y = popular, color=B)) +
geom_point(size = 3) +
facet_wrap(~Lightness, labeller = 'label_both') +
scale_color_gradient(low = "#00FFFF", high = "#104E8B") +
theme_bw()
Tuned Predicted Probability of outcome for Naive
Bayes
viz_grid %>%
bind_cols(predict(knn_tune, newdata = viz_grid, type = 'prob')) %>%
ggplot(mapping = aes(x = Hue, y = popular, color = B)) +
geom_point(size = 3) +
facet_wrap(~Lightness, labeller = 'label_both') +
scale_color_gradient(low = "#00FFFF", high = "#104E8B") +
theme_bw()
For most of the models trained using neural network, random forests, GBM, GAM and KNN, I have used Accuracy as the performance metric. I was unaware of AOC ROC curve until I completed all the models and went to the next slide.
By combining accuracy as the evaluation metric with k-fold cross-validation as the resampling scheme, you ensure that: - The model is evaluated based on its ability to correctly classify instances. - The model’s performance is assessed robustly across multiple train-test splits of the data, reducing the risk of overfitting or underfitting.
KNN Model: - The tuned KNN model achieved an accuracy of 0.8155969 with k = 7, slightly higher than the default.
GAM Model: - The accuracy for the tuned GAM model is 0.8622984. There was no change in accuracy before and after tuning.
Neural Network Models: - Default Neural Network: - Size = 3, Decay = 1e-01 - Accuracy: 0.8331720, Kappa: 0.4839001
Random Forest Models: - Default Random Forest: - Accuracy: 0.8578284, Kappa: 0.5708417 (mtry = 16)
GBM Model: - Default GBM: - Accuracy: 0.8554810 - Final values: n.trees = 100, interaction.depth = 3, shrinkage = 0.1, n.minobsinnode = 10
library(ggplot2)
# Create a data frame with model names and accuracy values
data <- data.frame(Model = c("KNN", "GAM", "Neural Network", "Random Forest", "GBM"),
Accuracy = c(0.8155969, 0.8622984, 0.8363562, 0.8583507, 0.8599150))
# Create the ggplot
ggplot(data, aes(x = Model, y = Accuracy)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = round(Accuracy, 4)), vjust = -0.3) + # Add text labels for accuracy values
labs(title = "Accuracy of Different Models", x = "Models", y = "Accuracy") +
ylim(0.8, 0.9) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
## Warning: Removed 5 rows containing missing values (`geom_bar()`).
gam_tune %>% readr::write_rds("gam_tune.rds")